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.When one e.;a.-nines the literature on criterion-referenced tests, which 
we will refer to as CRTs, one quickly notices two phenomena. First, no one 
agrees as to what a CRT is, precisely, in relation to objectives-based tests 
or domain referenced tests, the latter being abbreviated by the term DRTs 
(which may have to do with what some people think of them). Second all, of 
the relevant articles essentially deal with such issues as how the items 
should be constructed (Bormuth, 1970; David & Diamond, 1974; Hively, 1973; 
Popham, 1974); how many of them arr needed for one to say that an examinee 
has "mastered" an objective (Millman, 1973; Novick & Lewis, 1974); and how 
one should determine their reliability (Harris. 1975). Very few researchers 
or publishers have seemed to concern themselves with determining whether 
CRTs have empirical validity. • ^ 

One reason that most CRT developers Ignored empirical validity is that 
they seem to assume that the CRT construction process itself will almost 
automatically lead to a content valid test. The typical construction pro- 
cess usually involves identifying a set of supposedly important objectives 
and then developing "good" items to measure these objectives. The "good- 
ness" of an item is determined by two factors: (1) consistency of tne item 
with the objective it is supposed to assess; i.e., does the item measure the 
intended objsctiv? or some other objective? and (2) technical quality of 
the item, i.e.. is the item free of technical flaws and biases? If the items 
in a CRT are selected so as to assure that the specified domain or objective 
is adequately covered then we are supposed to believe that the CRT itself is 
'good" and useful for a wide range of educational decision making. We will 
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leave the issue of what constitutes "adequate coverage" of a domain for 
another time. Needless to say, what constitutes adequate coverage is not 
inmediately apparent either from the literature (Klein & Kosecoff, 1973) on 
the subject or an in-depth inspection of available CRTs (Kosecoff & Klein, 
1975). 

To summarize, then, a CRT is -generally considered to be valid if its 
items cover the specified domain or objective, if the items are judged to 
be appropriate members of the sample of Items that might be constructed to 
measure that objective in terms of the kinds of skills and content assessed, 
and if the items are free of technical Imperfections - a condition which is 
by no means guaranteed even by the-most rigorous of the current item-writing 
rules (Skager, 1975). The foregoing criteria may lead to the development of 
a test that is content valid in that scores on it describe an examinee's 
skills or knowledge. This information, however, is generally not directly 
useful to most users of CRTs, because most CRTs are administered in order to 
gather information that will be used in a wide array of educational decision 
making (Klein, 1970). Such decisions might deal with classroom management 
I problems of assigning students to groups for Instruction or deciding whan a 
' student or group of students is ready to progress to the next major unit of 
instruction. CRTs may also be used for evaluating the effectiveness of edu- 
cational programs and determining the kinds of curriculum that should be 
provided to students. In other words, users of CRTs want to be sure that 
the tests really provide valid information for making these important decisions 

The remainder of this paper will consider four features of CRTs. These 
features represent some of the supposed major advantages that CRTs have in 
comparison to norm-referenced tests (Popham, 1971). For each of these fea- 
tures, we will outline empirical procedures that one might use to assess the 
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extent to which a given CRT actually contains these desirable characteris- • 
tics. 

Describing Student Achievement 

- ' One advantage of a CRT is that it is supposed to provide a clear des- 
cription of what the student does or does not know or what the student can 
-or cannot do. Student mastery of a given objective is supposed to be mean- 
ingful in and of itself. The CRT accomplishes this by being based on a 
very specified objective or set of objectives, and, all the items on the 
CRT, presumably, are indicators of the extent to which the student has or 
has not mastered that objective. In other words, the items are consistent 
or congruent with the objective. 

One way of ensuring such consistency is to have expert judges indepen- 
dently evaluate each individual item to determine whether it actually be- 
longs with a given objective or with some other objective (Dahl , 1971). 
This determination could be made by having the judges sort all the items 
for a variety of CRTs according to the list of objectives that was used in 
developing these measures. 'A better technique, however, would be to have 
the judges form their own clusters of items and then see whether these 
clusters correspond to the initial set of objectives. Alternately, one 
could have judges infer the objective from an item in terms of the kinds 
, of skills and content knowledge that would be required to answer that item. 
This inference should closely correspond to the original objective on which 
the item was based. Finally, the construction process itself might be val- 
idated by having two teams of item writers develop items. Judges would be 
given the items written by both teams, in a completely scrambled fashion, 
and then be asked to perform the kinds of tasks noted above in order to 



determine item-objective consistency. If thf development procedure is ap- 
propriate, then the two sets of item writers should produce comparable items 
in the sense that the judges do not differentiate between them. 

The foregoing techniques all require expert judges, but as many of us 
have learned, judges are sometimes not as expert as we believe. It may be 
necessary, therefore, to use actual student response data to insure that an 
item is indeed measuring the objective for which it was intended. This 
could be done by using a sample of students who vaV'y in their levels of per- 
formance with respect to a variety of CRTs. Factor analyses of these data 
would indicate whether the items in a given CRT correlate more highly with 
each other than they do with items in other CRTs. If they do not. then one 
would. have serious questions about the viability of the CRT as being a good 
measure of a well-defined objective. Before one believes all the propaganda 
about the value of CRTs for describing student achieyement. then, one should 
be certain of the content validity of those CRTs, as established by empirical 
data. 

r.iirriculum Development 

A second supposed advantage of CRTs is that they operationally define 
important en route or cbmponent objectives that must be mastered in order 
for students to achieve some desirable goal. Teachers and evaluators can 
use CRTs., therefore, as a means for monitoring student progress towards the 
achieven^ent of this goal. The importance of such objectives, however, is 
generally established by theory and opinion rather than on the basis of em- 
pirical data. 

To the authors' knowledge, there is only one study that has attempted 
to establish the importance of an objective as operationally defined by the 



ERIC 



CRT that was used to measure it. In this study, McNeil (1975) divided a 
sample of students into two groups - those who could read a series of pas- 
sages aloud essentially without error and those who could not perform this ' 
task. He then compared the performance of the two groups on a series of 
CRTs that presumably assessed the component skills needed for performing 
.this criterion task of reading the passages 1 McNeil found that only a few 
of the CRTs were able to discriminate between the two groups. On the ba- 
sif of these results,, he concluded that it may not actually be necessary to 
teach certain objectives in order for a student to perform certain criterion 
tasks that are considered important in themselves. 

It is apparent that McNeil used a concurrent validity model to determine 
the relative importance of certain objectives. In so doing^, he also vali- 
dated the relevance of the CRTs he used to measure those objectives which 
wire deemed necessary for goal attainment. In other words, the fact that a 
iiven CRT wa>able to make the necessary discriminations between those who 
dirWTus those who did not master the goal indicates that performance on 
that CRT was relevant to that goal. 

There are a few probTems in the McNeil study that other researchers 
should be aware of before they try to »*epl1cate its approach. One problem 
is that a student who has mastered the goal may have forgotten how to per- 
form some of the en route tasks that were required as part of the learning 
process. For example, of those adults who use good grammar, how many .f 
then. are still able to diagram sentences properly? Further, it is also pos- 
sible that goal attainment could be achieved in a variety of patterns or 
that the criterion measure of the goal itself is faulty. While these prob- 
lems are not easy to resolve, it would be well worth the effort especially 



considering how much time is n»w spent on instructing students so that 
they can pass a group of CRTs whose importance is based on conjecture. 

Sensitivity to Instruction 

One of the most highly touted advantages of a CRT is that it is sensi- 
tive to the effects,. of instruction". Teachers are told by program evaluators 
that they no longer have to put up with test questions that are not germane 
to the particular instructional objectives they are trying to get their stu- 
dents to achieve. Thus, it is fair to use CRTs for assessing program out- 
con|es. 

There are two models for empirically establishing sensitivity to in- 
struction. The first model focuses on whether soecific items within a CRT- 
differentiate between those who have versus those who have not mastered the 
objective, after they have had instruction in the area to be covered by the 
test. An item that is sensitive is one that students fail prior to instruc- 
tion and pass after instruction (Kosecoff & Klein, 1973). The second model 
focuses on whether the CRT itself is sensitive to instruction in the sense 
that students who receive instruction perform better on the CRT than stu- 
dents of comparable ability who dp not receive such instruction. 

If a CRT fails to show the necessary sensitivity in one instance, one. 
could argue that it was the fault of the instruction and not the test. But 
if this pattern occurs frequently, one should question the yalidity. of the 
CRT itself in terms of its being sensitive enough to detect instructional 
outcomes for such purposes as program evaluation. 

Classroom Management 

The fourth major supposed advantage of CRTs is that they are useful for 
classroom management, especially where some form of individualization of in- 



( 



.-struction is in use. For example, the curriculum may be organized so that 
essentially all students proceed through the same sequence of objectives - 

'such as steps in a given strand of mathematics - but they do so at their 
own rate. In this context, CRTs are presumably the ideal tool for checking 
on whether a student is "ready" to move on to the next step. 

The assessment of a CRT's- utility for making these kinds of progress 
decisions would involve examining it^ predictive validity. This could be 
done by measuring the extent to which students who passed or mastered the 
CRT actually performed better in a subsequent instructional unit than stu- 
dents who did not pass (Keesling, 1974). Such performance would be indi- 
cated by test scores in the subsequent unit and/or by the time i t , took ^tjt1e 
student to master its objective and/or by other relevant indices of compe- 
tence. One important side benefit of this. kind of validation study is that 
it provides an empirical basis for setting mastery levels on CRTs. In other 
woFds, "mastery" could be operationally defined as that performance level at 
Which one has essentially eliminated such potentially costly classification 
errors as saying a student has mastered the objective when he has not. 

A predictive validity model could also be used in situations where CRTs 
are employed for grouping students for instruction in the sense that all 
students may not receive the same set of objectives and/or at the same rate 
and/or in the same order. The issue under investigation would again be the 
•ability of the CRT to make the relevant distinctions between student perfor- 
mance levels. ■ For example, if CRTs really facilitate the forminq of effective 
groups, then the. subsequent overall performance of the classes in which group- 
ing occurs should be better than in those classes that do not use CRTs for 
this purpose. 
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In short, if CRTs are truly useful for making classroom management de- 
cisions, then this advantage should be reflected in the performance of stu- 
dents. While this may not happen because of other extraneous factors, in 
every instance in which the CRT is used, there should at least be some in- 
dication of its utility when one examines its effectiveness across a vari- 
ety of sites. 

Sunwary 

In this paper, we have outlined four common uses for CRTs: describi^ig 
student achievement, -improving curriculum development. >ei ng sensitive in- 
dicators of the effects of instruction, and facilitating classroom manage- 
ment decisions. These uses parallel various forms of empirically establish- 
ing the content, concurrent, and predictive validity of the CRTs. What is 
disconcerting, however, is that the developers of CRTs have generally not 
conducted such validity studies, or at least they have not reported on them 
1 in the technical manuals for their CRT systems (Kosecoff & Klein. 1975). 

i It is time, therefore, for those of us who believe in the value of CRTs, to 

/ start providing proof of their utility for the tasks we claim they can per- 

form, just as we have required such evidence from the developers of norm- 
referenced tests . ., 
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